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1. INTRODUCTION 

The internet becomes an integrated platform in which one of its most intuitive aspects is the 
commercialization of social media networks (SMNs) [1], [2]. Many platforms have been established since the 
release of Web 2.0 such as media sharing (e.g., YouTube), blogs (e.g., Twitter), wikis (e.g., Wikipedia), 
social news (e.g., Reddit), and social networking (e.g., Facebook). Such platforms are used for content 
generation, information dissemination, and interactive communications [3]. SMNs are widely used in 
different fields of people's life such as health, education, marketing, and business. Such networks have 
become habits in many work areas [4]. Recently, business takes the advantage of SMNs to increase social 
interaction and drive growth [5]. People may share their ideas, perspectives, and views on several different 
topics with other users using these platforms. This may indicate that a huge amount of data is generated on 
social media sites (SMSs) where millions of users communicate with each other using these networks [6]. 
Networks refer to a set of relationships between a group of customers, suppliers, agents, and similar social 
connections [5]. SMNs allow access to individuals (customers, users, or companies) who share information 
about their social properties such as images, videos, and texts. This can increase their popularity and obtain 
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reactions and perspectives of other people [4], [7]. Users of SMNs also share their ideas, daily activities, and 
opinions [8]. 

The extended network includes a user's friends of friends, although there is no direct relationship 
between them [9], [10]. A user can create a profile on SMNs such as Facebook for free to communicate with 
different people [11]. A user’s profile is considered a social identity. A profile consists of a basic description 
about the user such as name, age, gender, education, marital status, email, phone number, and location. 
Earlier literature exploits such kind of information to provide recommendations for users [12]—[14]. 

Information that may describe users' behavior on SMNs is called users' preferences such as 
accepting new friends, interested in sports news, sharing their posts, looking for new friendships, searching 
for a new job, and accepting requests from other people. Users may share similar preferences and interests in 
SMNs which could be analyzed to discover the nearest users [15]—[17]. However, searching for people, 
products, industries, businesses, or jobs is not a trivial task due to the huge amount of data generated on 
SMNs every day [7]. For example, if a user searches for a person nearby who shares a similar profile and/or 
preferences, the user needs to search posts and users’ profiles. On the other hand, SMNs may suggest nearby 
people based on several different features of similarity. 

Although the suggested results in earlier literature could be an easy way and with less effort to 
generate an individual network, a user may not have the willingness to interact with such results. Users are 
probably unfamiliar with the process of suggesting the nearest people through SMNs which may be 
suggested based on either profile or preferences. For example, it can be assumed that there is matching 
between a group of users in terms of age, gender, and education, but each user has his/her own preferences 
such as preferring a particular football team, accepting new friendships, and allowing to share his/her posts. 
Thus, it might be difficult to group users together based on these differences. To address this issue, users’ 
profiles and preferences can be exploited together to search for the nearest users. This study, therefore, 
contributes to the body of research by proposing an approach for discovering the nearest users based on users’ 
profiles and preferences. Proposing this hybrid approach can address the limitations in previous literature and 
provide more reliable results. It can also help recommend items that fit users' needs based on the preferences 
of other close users. 

The rest of the paper is organized as follows. In section 2, previous approaches are reviewed and 
analyzed. In section 3, the proposed approach is presented. Section 4 presents the key findings of this study. 
Section 5 discusses the research findings, whereas section 6 concludes the key concepts of this research and 
the possible future directions. 


2. RELATED WORKS 

A review of previous approaches that have been used for discovering and matching users’ profiles in 
SMNs is presented here. The discussed literature is grouped into two categories namely, approaches that 
relied on users’ profiles and approaches that relied on users' preferences. About the first category, according 
to Raad et al. [18] a framework for matching users’ profiles that refer to the same person on two SMNs was 
proposed. The suggested approach included two phases: i) profiles were converted into an extensible markup 
language (XML) file and ii) a set of profiles' attributes were compared using the similarity function. If the 
similarity score of two profiles is higher than a profile matching threshold, this means that both profiles are 
for the same user. In another research study [19], an approach for matching profiles was suggested to 
discover if different profiles belong to the same user. Screen names on Facebook were analyzed by removing 
nicknames, converting names into lowercase, splitting spaces, and matching the extracted names to the 
corresponding screen names on Twitter. According to Thelwall [20] an approach for analyzing users’ profiles 
was used where it focuses on users’ ages and gender more than other features such as location, status 
(i.e., married, single), and the number of registered friends. Although such frameworks were able to identify 
profiles on two SMNs that refer to the same user, users’ preferences are ignored. Another research study 
exploited users’ profiles on SMNs to address the cold start issue as users' demographic data was considered 
such as gender, age, and occupation [21]. In such context, [22] also used demographic information of users to 
compute the similarity weight of the users' behavior values and their rating value. In another research by 
Senefonte et al. [23], users' profiles on social media sites were considered to predict their mobility as tourists. 
In the proposed approach, users were grouped using fuzzy C-means and self-organizing maps. 

About the second category, [24] investigated users' preferences based on their location to 
recommend, for example, the most appropriate restaurants. The study automatically identified user 
preferences based on their location history and opinions. Similarly, according to Bao et al. [25] a systematic 
review was provided on investigating the methods used in recommender systems based on SMNs. According 
to Jamil et al. [26] an approach was proposed for identifying the nearest users by measuring the similarity of 
preferences. Users with the same preferences were grouped and recommended to a particular user. According 
to Kim et al. [9] an approach to suggest similar users based on preferences was also suggested. This approach 
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consists of three tasks which are: i) clustering user’s preferences; ii) measuring similarity between two users 
on a scale ratio; and iii) providing suggestions. Another research study proposed a model based on measuring 
the similarity between users’ preferences [15]. Finally, according to Torrijos et al. [27] a method was 
suggested for discovering related users in terms of location-based social networks. The method assumes that 
users are more similar when they visit the same place at the same time and share similar preferences. 
However, the properties of users’ profiles are ignored in such studies. 

It can be seen that the reported literature did not take into consideration both users’ profiles and 
users' preferences. In addition, some of the reported approaches are either designed for identifying the same 
user from different networks or for matching profiles in terms of users’ names only. However, users’ age, 
gender, education, and preferences are considered essential properties of a user which have to be taken into 
consideration when discovering the nearest users to a particular user [28], [29]. This research attempts to 
address this issue in previous work by proposing an approach for discovering the nearest users based on both 
profiles and preferences. 


3. THE PROPOSED APPROACH 

The proposed approach consists of three key stages. First, users are grouped based on their 
preferences. Second, the nearest users are identified by measuring the boundary distance of the grouped users 
using properties’ values of profiles and preferences. Finally, the best nearest group of users are indicated 
through measuring similarity between them. 


3.1. The research datasets 

Two datasets were used which are the Book-Crossing dataset [30] and the personality dataset [31]. 
The Book-Crossing dataset consists of 278,858 users with 1,149,780 ratings, as well as 271,379 books. It was 
generated with three main tables namely, BX-Users, BXBooks, and BX-Book-Ratings. 

The BX-Users table consists of userID and age to represent users' profiles. The BX-Books table 
includes ISBN, book-title, book-author, year-of-publication, and publisher in which ISBN represents a book 
identity. BX-Book-Ratings table encompasses users’ preferences, book rating, userID, and ISBN. The dataset 
is available in two formats structured query language (SQL) dump [26.391 Kb] and comma separated value 
(CSV) dump [25.475 Kb]. In this study, the SQL dump was downloaded and converted into SQL server 
database where schema information and data insertion statements are found in SQL dump which can be run 
as a SQL script. 

Ten users were selected randomly from the dataset in which the proposed approach was 
implemented. However, the results of only four users are reported in this research as an example of the 
performance of the proposed model. Table 1 shows a description of the selected users from the Book- 
Crossing dataset. The users' profiles include userID and age, whereas the number of preferred books and the 
average rating is considered as users’ preferences. 


Table 1. A description of the selected users 


User profile features User preferences features 
userID Age Number of preferred books Average rating 
10 26 2 3 
2,276 46 498 3 
7,400 22 5 7 
13,877 27 4 6 


The personality dataset consists of 1,834 users with 1,028,751 ratings, as well as 198,117 movies. It 
was generated with two main tables namely, personality data and ratings. Personality data consists of userID 
and the five personality types which are scored from 1 to 7. The five personality types are openness, 
agreeableness, emotional stability, conscientiousness, and extraversion. Openness evaluates users’ attitudes to 
choose a new experience. Agreeableness assesses users’ behavior toward others such as compassion, 
cooperation, suspicion, and antagonism. Emotional stability evaluates users’ attitudes to psychological stress. 
Conscientiousness evaluates users’ behavior such as organization, dependability, and show self-discipline. 
Extraversion assesses users' attitudes to be outgoing [32]. The rating table contains the ratings of users. Ten 
users were also selected randomly from the dataset, but the results are reported here for four users only to 
show the implementation of the proposed approach. Table 2 presents a description of the selected users from 
the personality dataset. 
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Table 2. Description of the selected users from the personality dataset 


UserID Openness Agreeableness Emotional Conscientiousness Extraversion Average 
stability rating 
8e7cebf9a234c064b75016249f2ac65e 5 2 2) 2.5 6.5 3.48 
8f45de0d6ded99d3c81f32100d08ab85 3.5 4 5:5 2.5 4.5 3.65 
abb3365a8b5a8ed98497 1 10974d2 1089 5 4 3 3 2 3.50 
cdc60474a84852f28f42441686869c05 7 6 3.5 2 25 3.79 


3.2. Users' grouping 

In SMNs, users post their views or interact with others by sharing information based on their 
preferences. In this stage, they are grouped based on their preferences such as desires or interests. A user may 
prefer watching action movies, family movies, or follow documentary reports. Other users may share similar 
preferences. Accordingly, the similarity of preferences is used for grouping users. For example, it can be 
assumed that user X prefers action movies and follows documentary reports, whereas user Y prefers family 
movies and follows documentary reports. Thus, XN Y=1 preference based on the documentary reports. 

Hence, X and Y can be grouped with the number of movies that they liked or tagged, as shown in 
Algorithm 1. It is worth mentioning that the number of preferences (i.e., X prefers six documentary reports) is 
not used for grouping users here. This is used in the next stage for measuring the boundary distance between 
the grouped users. G refers to a group of users who share similar preferences, whereas P refers to a 
preference. Hence, XNY={P1, P2, P3}. XNZ={P8}. X W={}. Therefore, G={X, Y, Z}. 


Algorithm 1: Grouping users 
Input: User X, List of Users L 5 Py == {list of preferences of Y} 
Output: List of Grouped Users 6 if PxM Py # Ø then 

1: Px &= {list of preferences of X} 7: G &= Y 

2: G &= Ø list of grouped users 8 end if 

3: for each user in L do 9: end for 

4 Y &= L 


3.3. The boundary distance of the grouped users 

In this stage, the properties’ values of a user profile and preferences are used to identify the nearest 
users. Although a user may have more than one profile in different SMNs, the properties of each profile are 
designed and assigned based on the terms and conditions of each social media network which differ from one 
network to another. In this research, the properties of a profile are used to identify the nearest users, whereas 
different profiles of same users on different SMSs are not considered. 

The grouped users (the output of Algorithm 1) are analyzed by measuring the boundary distance to 
user X (1). A boundary distance between user X and the grouped users is represented as a radius of the circle 
and indicated as a threshold to identify which users are assigned within the radius boundary. Hence, the 
accuracy of discovering the nearest users may be improved using boundary distance [4], [6]. 


n y -7)2 n y- 
Boundary Distance (X,Y) = Zaoa — Gate): (1) 
where n refers to the number of features of user X and Y. x refers to u of features’ values (profile properties 
and preferences) of user X (2). Each feature’s value of Y is subtracted from the u such as (y;-x). 


Boundary Center (X) = is (2) 


where v refers to the features’ values of user X. Y is ignored when it is assigned outside of the boundary 
distance. Users who are assigned inside the boundary distance are considered the nearest users to user X, as 
shown in Algorithm 2. 


Algorithm 2: Nearest users 
Input: User X, List of Grouped Users L 5 Bdy == Boundary Distance between X and Y 
Output: List of Nearest Users to X 6 if Bdy < Bdx then 

1: N &= Ø list of nearest users to X TE N &= Y 

2: Bdx €= Boundary of X 8 end if 

3: for each user in L do 9: end for 

4: Y &= L 
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3.4. Measuring similarity of the nearest users 

In the previous stage, the nearest users to X are identified by measuring boundary distance. 
Although nearest users who share preferences and properties’ values of profiles may be obtained by using the 
boundary distance as described previously, measuring the similarity between X and the nearest users is 
required. For example, it can be assumed that user X prefers 20 documentary reports, whereas the nearest 
user Y prefers 20 family movies and only 1 documentary report. User Z, on the other hand, prefers 15 
documentary reports. Hence, Z is identified as the best nearest user to X according to the similarity function 
which is known as cosine similarity (3): 


Me igi 
Similarity (X,Y) = P ES (3) 
i=1 i) * [Aja O74 


Algorithm 3 shows the main steps to measure similarity between X and the nearest users (i.e., within 
the boundary of X) using features’ values (profile’s properties and preferences) which represent users' desires 
and interests. In Algorithm 3, a threshold ¢ is used to evaluate the similarity between users. Where f is a user- 
defined in which it is set between 0.5 and 0.9. 


Algorithm 3: Measuring similarity 
Input: User X, List of Nearest Users L, Vy == {list of features?’ values of Y} 


5 
threshold t 6: CSxy == Similarity between Vx and Vy 
Output: List of Similar Users to X 7: if CSxy 2 t then 
8 
9 
(0) 


1: S <= {Ø} list of similar users to X 


s &= Y 
2: Ve €= {list of features’ values of X} : end if 
3: for each user in L do 10: end for 
4: Y = L 


3.5. Evaluation 

The performance of the proposed approach is evaluated based on precision. It measures the number 
of users with a similarity value greater or equal to a threshold (between 0.5 and 0.9) divided by the number of 
users as shown in (4): 


P = C/(C+1) (4) 


where C refers to the number of users with a similarity value greater or equal to the threshold. / refer to the 
number of users with a similarity value less than the threshold. 


4. RESULTS AND EVALUATION 

The proposed approach is implemented using SQL and server management studio 2017 to store 
users’ profiles and preferences and run queries. To evaluate the ability of the proposed approach for 
discovering nearest uses, three experiments were conducted which are labeled Expl, Exp2, and Exp3. Each 
experiment uses the same set of data. The experiments were conducted on ten users from each dataset, but the 
results are reported for four users only as other results agree with the reported findings. Therefore, repetition 
has been avoided. 

Exp! refers to the use of the proposed approach using users’ profiles and preferences. Exp2 refers to 
the application of the proposed approach using users’ preferences only, whereas Exp3 refers to the 
implementation of the proposed approach using users’ profiles only. The findings of these three experiments 
are compared. 


4.1. Results of the Book-Crossing dataset 

Table 3 shows the number of nearest users discovered with Expl and Exp2. The number of the 
discovered users with Exp2 is greater or equal to that discovered with Exp1. However, there is a possibility to 
discover users incorrectly according to the similarity function. Figure 1 presents a comparison between the 
outcomes of Exp! and Exp2 in terms of the number of nearest users discovered correctly and incorrectly for 
each user in Table 1. 

Based on the findings presented in Figure 1, the performance of the proposed approach with Exp! is 
better than Exp2. Although the number of the discovered users with Exp] is less than Exp2 in some cases, 
the number of the nearest users discovered incorrectly in Exp! is significantly less than the number reported 
in Exp2 in most cases, supporting the validity of the proposed approach. Figure 2 (see in Appendix) shows a 
comparison between Exp! and Exp2 for each user in terms of precision using the reported number of the 
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discovered users correctly and incorrectly (see Figure 1). The similarity measure reported higher precision 


between 0.5 and 0.9 in most cases with Expl. The reason behind this is that discovering the nearest users 
requires both the profiles’ properties and preferences. 


Table 3. The number of nearest users with Expl and Exp2 


UserID Number of nearest users with Exp] 


Number of nearest users with Exp2 


10 1 
2,276 2 
7,400 1 
13,877 3 
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Figure 1. The number of nearest users of the first two users with Exp1 and Exp2 


4.2. Results of the personality dataset 

Table 4 shows the number of the nearest users discovered with Exp1 and Exp3. The number of the 
discovered users with Exp3 is greater or equal to that discovered with Exp1. However, there is a possibility to 
discover users incorrectly according to the similarity function, as shown in Figure 3 which presents a 
comparison between Exp! and Exp3 in terms of the number of the nearest users discovered correctly and 
incorrectly for each user in Table 4. 

In Figure 3, the performance of the proposed approach with Exp1 is better than Exp3. Although the 
number of the discovered users with Exp! is smaller than Exp3 in some cases, the number of nearest users 
discovered incorrectly in Exp] is less than the number reported in Exp3 in most cases. Figure 4 shows a 
comparison between Exp! and Exp3 for each user in terms of precision using the reported number of the 
discovered users correctly and incorrectly (see Figure 3). It can be seen that the similarity measure reported 
higher precision between 0.5 and 0.9 in most cases with Expl. This finding confirms that the proposed 
method is adequate in predicting the nearest users than relying on either users' profile features or users' 
preferences only. 


Table 4. Number of nearest users with Exp! and Exp3 


userID Number of nearest users with Exp] _ Number of nearest users with Exp3 
8e7cebf9a234c064b75016249f2ac65e 34 20 
8£45de0d6ded99d3c8 1£32100d08ab85 75 94 
abb3365a8b5a8ed98497 1 10974d21089 16 38 
cdc60474a84852f28f4244 1686869c05 81 89 
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Figure 4. The results of the personality dataset for four users 
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5. DISCUSSION 

Current literature develops several different algorithms to search for similar users based on either 
users’ profiles or users’ preferences. However, considering both is necessary as it provides more reliable 
results in identifying the nearest users. Some of the reported approaches are either designed for identifying 
the same user from different networks or matching profiles in terms of users' names only. This research, 
however, attempted at addressing this gap by proposing an approach that can discover the nearest users based 
on users' profiles and preferences. 

The proposed approach relies on three key phases. First, users are grouped based on their 
preferences (i.e., desires or interests). Second, the properties’ values of a user profile and preferences are 
used to identify the nearest users. Although some users may have more than one profile on different SMNs, 
the properties of each profile are designed and assigned based on the terms and conditions of each social 
media network which differ from one network to another. In this research study, therefore, properties of 
profiles are used to identify the nearest users, whereas the same or similar users are not considered. Third, the 
nearest users to a certain user are identified by measuring the boundary distance. 

Although nearest users which share similar preferences and properties’ values may be discovered 
using the boundary distance, measuring the similarity between the user and the nearest users are required. 
The performance of the proposed approach was evaluated concerning precision. It has been found that the 
proposed approach achieved higher accuracy in comparison with the approaches that rely on either users’ 
profiles or users' preferences. In all cases, the number of incorrectly predicted nearest users in the proposed 
approach was less than the incorrectly predicted nearest users in Exp2 and Exp 3 which relied on users' 
profiles or preferences respectively. 

The approach presented in [22] is compared with the proposed approach of this present research. 
According to Widiyaningtyas et al. [22] however, the approach identifies the nearest users by measuring the 
similarity of preferences only. Users with the same preferences were grouped and recommended to a specific 
user. The best precision result achieved in [22] was less than 60 where the similarity is greater or equal to 
0.83. On the other hand, the proposed approach in this present research achieves better results in all aspects 
based on the integration of both profiles and preferences. 


6. CONCLUSION 

Based on the research outcomes, it is recommended to use both users’ profile information and their 
preferences in SMNs or other personalized sites such as Amazon or customers' sites. This can provide better 
recommendations that may meet their actual needs effectively. This research is distinguished from previous 
works in that the proposed approach is capable for discovering the nearest users based on both profiles and 
preferences. Although the findings of the proposed approach are promising, future work in this field can be 
conducted in other directions. Further implementation of the proposed approach can be done by integrating 
Algorithm 1 with the k-means clustering algorithm to discover a group of the nearest users. Thus, the number 
of clusters can be identified dynamically. Another possible research direction is to evaluate the proposed 
approach using other datasets and implement it in several domains. 


APPENDIX 
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Figure 2. The results of the Book-Crossing dataset 
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Figure 2. The results of the Book-Crossing dataset (continue) 
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